Goto

Collaborating Authors

 phylogenetic tree


Heat and Matérn Kernels on Matchings

Eremeev, Dmitry, Said, Salem, Borovitskiy, Viacheslav

arXiv.org Machine Learning

Applying kernel methods to matchings is challenging due to their discrete, non-Euclidean nature. In this paper, we develop a principled framework for constructing geometric kernels that respect the natural geometry of the space of matchings. To this end, we first provide a complete characterization of stationary kernels, i.e. kernels that respect the inherent symmetries of this space. Because the class of stationary kernels is too broad, we specifically focus on the heat and Matérn kernel families, adding an appropriate inductive bias of smoothness to stationarity. While these families successfully extend widely popular Euclidean kernels to matchings, evaluating them naively incurs a prohibitive super-exponential computational cost. To overcome this difficulty, we introduce and analyze a novel, sub-exponential algorithm leveraging zonal polynomials for efficient kernel evaluation. Finally, motivated by the known bijective correspondence between matchings and phylogenetic trees-a crucial data modality in biology-we explore whether our framework can be seamlessly transferred to the space of trees, establishing novel negative results and identifying a significant open problem.



Geometry-Aware Adaptation for Pretrained Models

Neural Information Processing Systems

Machine learning models--including prominent zero-shot models--are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them.


D AB C AB CD ABC D ABC D AB CD D AB CD AB CD AB CD assign

Neural Information Processing Systems

For example, one can simply use the splits of the edges on phylogenetic trees, and assign parameters for each split inSr. Let π be a permutation ofS and Sc, that isπ(S) is a rearrangement ofS and π(Sc) is a rearrangementofSc.



Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models

Wang, Linfeng, Campino, Susana, Clark, Taane G., Phelan, Jody E.

arXiv.org Artificial Intelligence

Positive selection drives the emergence of adaptive mutations in Mycobacterium tuberculosis, shaping drug resistance, transmissibility, and virulence. Phylogenetic trees capture evolutionary relationships among isolates and provide a natural framework for detecting such adaptive signals. We present a phylogeny-guided graph attention network (GAT) approach, introducing a method for converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis. Using 500 M. tuberculosis isolates from four major lineages and 249 single-nucleotide variants (84 resistance-associated and 165 neutral) across 61 drug-resistance genes, we constructed graphs where nodes represented isolates and edges reflected phylogenetic distances. Edges between isolates separated by more than seven internal nodes were pruned to emphasise local evolutionary structure. Node features encoded SNP presence or absence, and the GAT architecture included two attention layers, a residual connection, global attention pooling, and a multilayer perceptron classifier. The model achieved an accuracy of 0.88 on a held-out test set and, when applied to 146 WHO-classified "uncertain" variants, identified 41 candidates with convergent emergence across multiple lineages, consistent with adaptive evolution. This work demonstrates the feasibility of transforming phylogenies into GNN-compatible structures and highlights attention-based models as effective tools for detecting positive selection, aiding genomic surveillance and variant prioritisation.





Hyperbolic Genome Embeddings

Khan, Raiyan R., Chlenski, Philippe, Pe'er, Itsik

arXiv.org Artificial Intelligence

Current approaches to genomic sequence modeling often struggle to align the inductive biases of machine learning models with the evolutionarily-informed structure of biological systems. To this end, we formulate a novel application of hyperbolic CNNs that exploits this structure, enabling more expressive DNA sequence representations. Our strategy circumvents the need for explicit phylogenetic mapping while discerning key properties of sequences pertaining to core functional and regulatory behavior. Across 37 out of 42 genome interpretation benchmark datasets, our hyperbolic models outperform their Euclidean equivalents. Notably, our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets, consistently outperforming many DNA language models while using orders of magnitude fewer parameters and avoiding pretraining. Our results include a novel set of benchmark datasets--the Transposable Elements Benchmark--which explores a major but understudied component of the genome with deep evolutionary significance. We further motivate our work by exploring how our hyperbolic models recognize genomic signal under various data-generating conditions and by constructing an empirical method for interpreting the hyperbolicity of dataset embeddings. Throughout these assessments, we find persistent evidence highlighting the potential of our hyperbolic framework as a robust paradigm for genome representation learning. Our code and benchmark datasets are available at https://github.com/rrkhan/HGE.